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Abstract 

In a shotgun proteomics experiment, proteins are the most biologically meaningful output. 
The success of proteomics studies depends on the ability to accurately and efficiently identify 
proteins. Many methods have been proposed to facilitate the identification of proteins from 
the results of peptide identification. However, the relationship between protein identification 
and peptide identification has not been thoroughly explained before. 

In this paper, we are devoted to a combinatorial perspective of the protein inference prob- 
lem. We employ combinatorial mathematics to calculate the conditional protein probabil- 
ities (Protein probability means the probability that a protein is correctly identified) under 
three assumptions, which lead to a lower bound, an upper bound and an empirical estimation 
of protein probabilities, respectively. The combinatorial perspective enables us to obtain a 
closed-form formulation for protein inference. 

Based on our model, we study the impact of unique peptides and degenerate peptides 
on protein probabilities. Here, degenerate peptides are peptides shared by at least two pro- 
teins. Meanwhile, we also study the relationship of our model with other methods such as 
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ProteinProphet. A probability confidence interval can be calculated and used together with 
probability to filter the protein identification result. Our method achieves competitive results 
with ProteinProphet in a more efficient manner in the experiment based on two datasets of 
standard protein mixtures and two datasets of real samples. 

We name our program Proteinlnfer. Its Java source code is available at: 
[http: / /bioinf ormatics .ust . hk/proteininf er| 

Introduction 

Proteomics is developed to study the gene and cellular function directly at the protein level.^ 
In proteomics, mass spectrometry has been a primary tool in conducting high-throughput experi- 
ments. In a typical shotgun proteomic experiment, proteins are digested into peptides by enzymes 
and analyzed by a mass spectrometry to generate single stage mass spectra Some peptides are 
fragmented into smaller ions and analyzed to produce tandem mass spectra. Identifying peptides 
from tandem mass spectra leads to the development of peptide identification methods.™^ Protein 
inference is to derive proteins from peptide identification. Conducting protein identification in an 
accurate and high-throughput manner is a primary goal of proteomics.^ 

Quantitative measurement of protein identification confidence has been a major concern in 
developing new protein inference models. The calculation of protein probability has become 
popular as it provides nice properties in terms of distinction and accuracy: 

• Distinction: Protein probability is a quantitative measurement of protein identification con- 
fidence that different proteins are distinguishable based on their probabilities. 

• Accuracy: By assigning each protein with a probability, we can have a statistical interpre- 
tation of the protein identification result. Thus, the protein identification result can be more 
reliable. 

To date, many statistical models for protein probability calculation have been proposed. 10 The 
readers may refer to a recent review for detailed description on these methods.^ Depending on 
the strategy to obtain protein probabilities, they can be grouped into the following two categories: 
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• Probability Models: Methods in this category partitioned the degenerate peptide probability 
among corresponding proteins. Degenerate peptides are peptides shared by at least two 
proteins. Then, they calculated the probability of a protein as the probability that at least 
one of its peptides was present. The partition weights were iteratively updated in an EM- 
like algorithm.E2HI2] 

• Bayesian Methods: Methods in this group modeled the process of mass spectrometry in 
a generative way by using the rigorous Bayesian framework.^^ The Bayesian models 
are complicated in their formulation. To obtain the solution, computational methods such 
as Markov chain Monte Carlo (MCMC) are essential. More importantly, methods such 
as MSBayesPro need extra information. The performance depends on the reliability of 
extra information and the methods may not be applicable under different conditions. Fido 
provides a way to calculate the marginal protein probability based on the Bayesian formula. 
However, the computational complexity of the method is exponential with respect to the 
number of distinct peptides.^ 

In this paper, we provide a combinatorial perspective of the protein inference problem to 
calculate protein probabilities as well as to understand the probability partition procedure. Our 
contributions can be described in the following aspects: 

• By computing the marginal protein probability, we have a concise protein inference model 
with an analytical solution, whose computational complexity is linear to the number of 
distinct peptide. 

• We deduce a lower bound and an upper bound of protein probability. Two bounds define 
a probability confidence interval, which is used as an alternative factor other than protein 
probability in filtering the protein identification result. 

• The impacts of unique peptides and degenerate peptides are studied mathematically based 
on our model. We also discuss some promising ways to further improve the distinction of 
protein inference. 
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We discuss the connection of our method with other methods such as greedy methods and 
ProteinProphetJ 12 ' 21122 1 



The rest of our paper is organized as follows: Section 2 describes the details of our method; 
Section 3 presents the experimental results; Section 4 discusses and concludes the paper. 

Method 

Figure 1: The data structure of a protein identification problem. We need to estimate protein 
probabilities given peptides probabilities and peptide-protein mapping. In the figure, peptide 1 
and peptide 2 are degenerate peptides whereas peptide 3 and peptide M are unique peptides. 
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Peptides from the same protein are assumed to contribute independently to presence of the pro- 
tein. The assumption relates to Naive Bayes, in which features of an object contribute indepen- 
dently to the object. From the viewpoint of statistics, the independence assumption not only 
simplifies the modeling procedure, but also leads to stable results when information such as the 
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inner structure of data is insufficient. The presence and the absence of peptides from different 
proteins are also independent. 

Figure [T] shows the data structure of the protein inference problem. In a protein identification 
result, there are many candidate proteins. For simplicity, we would like to focus on one protein 
to keep our notations concise. 

Suppose a protein has M peptides. We use Pr(y = 1) and Pr(y = 0) to denote the present 
probability and absent probability of the protein, respectively. For peptide i, Pr(x ; = 1) and 
Pr(xi = 0) represent its present and absent probabilities, respectively. Let n ; be the number of 
proteins which share peptide i. When > 2, peptide i is a degenerate peptide; otherwise, it is a 
unique peptide. We use £f = {i\xj =1} and £f = {i\xi = 0} to denote the set of present peptide 
indices and the set of absent peptide indices, respectively. For each pair of Sf and we have 
£f U£f = { 1 , 2, ...,M}. When i E @ ', it means that peptide i is present (i.e. x\ = 1). Similarly, i E £f 
means the absence of peptide i (i.e. Xj = 0). Denote the set {xi\i e £f} and the set {xt\i E 
as X<g and Xg, respectively. The conditional probability of the absence of the protein given 
peptides is denoted as Pr(j = 0\Xcf,Xy). Let Pr(Xof) and Pr(X^) be Ylie^ r { x i) an ^ Il/e^^C*')' 
respectively. 

Each peptide of a protein is assumed to contribute independently to the protein. According to 
the basic probability theorem, the probability of a protein being absent is calculated as: 

M 

Pr(v = 0)= £ L - E Pr(y = 0|x 1 ,x 2 ,...,x M )n Pr (^) 

xie{0,i}x 2 e{0,i} x M €{0,i} i=\ ^ 

= £ Pr(y = 0|%,%) J Pr(X^) J Pr(%). 

To calculate the protein probability given by equation (1), we need to calculate the conditional 
probability Pr(j = 0\Xg,Xg). Directly computing the protein probability based on equation (1), 
we need 2 M operations. In the following sections, we will provide a way to calculate the protein 
probability with the number of operations that is linear with respect to M. 

When Sf is empty: 

Pr(y = 0|%) = Pr(y = 0|xi = 0,x 2 = 0, ...,x M = 0) = 1. (2) 
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When Sf is not empty, different assumptions lead to different results. In the following sections, 
we calculate the conditional protein probabilities based on three different assumptions. These 
three assumptions lead to an upper bound, a lower bound and an empirical estimation of protein 
probability, respectively. 

Conditional Probability Based on A Loose Assumption 

A loose assumption supposes that, when peptide i is detected, all corresponding proteins are 
present. In other words, this peptide is contributed by all corresponding proteins. This assumption 
is related to the one-hit rule used in protein inference.^ 

When Sf is not empty, the corresponding protein must be present. Thus, the conditional absent 
probability is: 

Pr(v = 0|%,%)=0. (3) 

The absent probability is calculated as: 

M 

Pr(y = 0) =n pr (*i = °) + E Pr (? = 0|%,%)Pr(%)Pr(%) 
M 

= YlPr( Xl = 0) (4) 

!=1 

M 

= n(l-Pr(* = l)). 

!=1 

The loose assumption leads to an upper bound of the protein probability: 

M 

Pr (/ (v=l) = l-n(l-Pr(^ = l))- (5) 

(=1 

Conditional Probability Based on A Strict Assumption 

A strict assumption supposes that, if a peptide is detected, it only comes from one corresponding 
protein containing the peptide. Or equivalently, the peptide occurs for just once. This concept is 
mostly related to greedy protein inference methods, which iteratively select a protein that explains 
most of remaining peptides and removes peptides that have been explained.! 21 * 22 ! 
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When 5f is not empty, the total number of ways to explain observed peptides (i.e. peptides 
with corresponding indices in £f ) is given by: 

«=n(7)=n».-- <« 

Here, is the number of proteins containing peptide i. When the protein is absent, the number 
of times that peptide i is shared is decreased by 1. Then, the total number of ways to explain the 
observed peptides is: 

^nf^r 1 ) =n^ C7) 

The conditional probability is given by: 

ftfr = 0^,^) = ^ = IL g ( --' ) . (8) 

We can consider two special cases to get some intuitions from equation (8). 

Case 1: If there is any unique peptide (i.e. ni = 1), then Pr(v = 0\Xy,X c f) = 0. This indicates 
that the protein must be present if the corresponding unique peptide is observed. 

Case 2: Suppose Sf = {1} and = {2,3, ...,M}. If the first peptide is shared by an infinity 
number of proteins (i.e. n\ — > °°), then we have: 

Pr (y = \Xg , X#) = lim = 1 . (9) 

Equation (9) indicates that, if the first peptide is shared by too many proteins, we cannot determine 
exactly the corresponding protein. The probability of the absence of the protein is 1. 
When the conditional probability (8) is applied, we have: 



Pr(y = 0)= £ Pr(y = 0|%,%)Pr(%)Pr(%) 



=n(i--pr(*=i)). 

i=l n ' 

Details are available in our supplementary document. 
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This assumption is very strict and leads to a lower bound of protein probability: 



M j 



Pr L (y = 1) = 1-11(1- -Pr(*,- = !))• (U) 



/=1 



Conditional Probability Based on A Mild Assumption 

A mild assumption supposes that, all proteins containing the peptide may generate the peptide. 
The presence of a peptide is contributed by either one or multiple proteins. 

When 5f is not empty, the total number of ways to explain observed peptides (i.e. peptides 
with corresponding indices in 5f ) is given by: 



n, = u 



I 

k=l 



n(2"'-i). (12) 



Here, n ; is the number of proteins that share peptide i. When the protein is absent, the number of 
times that peptide i is shared is n,- — 1. Thus, the number of ways to explain the peptides above is 
given by: 



Then, we have: 



^=n(2 ( " i_i) -i). (i3) 



u 1 s w> n, n^(2 n '-i) 



Similarly, let us consider two special cases to get some insights from equation (14). 

Case 1: If there is any unique peptide (i.e. n ( - = 1), then Pr(y = 0\X<g,Xg)) = 0. The corre- 
sponding protein must be present to explain this unique peptide. 

Case 2: Suppose Sf = {1} and = {2,3, ...,M}. If the first peptide is shared by an infinity 
number of number of proteins (i.e. n\ — > oo), then we have: 

Pr(v = 0|%,X # ) =^im 2 ^^i = \. (15) 

Equation (15) has a meaningful interpretation. If peptide 1 is shared by an infinity number of 
proteins, determining the presence of a corresponding protein is like random guessing. The prob- 
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abilities of the presence and absence of this protein are therefore both 0.5. 

The strict assumption is exclusive. If a degenerate peptide has already been explained, other 
proteins are not considered. In contrast, the mild assumption is inclusive. Explaining a degenerate 
peptide with one protein will not affect the presence of other proteins containing this degenerate 
peptide. Thus, we obtain different results in equation (9) and equation (15). 

The absent probability based on the mild assumption reads: 

Pr(y = 0)= £ Pr(v = 0|%,%)Pr(^)Pr(%) 

=nd- ^T)^=i)). 

The proof can be found in the supplementary document. 

The assumption leads to an empirical estimation of protein probability: 

M 2 n i 

n E (y = l) = l - n(i - Pr (*< = !))■ 

Marginal Protein Probability 

The relationship of the three protein probabilities based on the three assumptions above is: 

M i 

Pr L (v=l) = l-n(l--Pr(x J = l)) 

i=l n i 

M 2 n i 

< Pr £ (v = 1) = 1 - 2(2 ,,_ 1) Pr (^ = !)) (18) 

M 

<Pr^(y=l) = l-n(l-Pr(^=l)). 

(=l 

Here, ni > 1 is the number of times that peptide i is shared. Readers can refer to our supplementary 
document for the proof. The closed-form inequality (18) can be used to calculate the lower 
bound, the empirical estimation and the upper bound of protein probability efficiently. The total 
numbers of operations to calculate Pr^, Pr# and Pru are linear with respect to the number of 
distinct peptide. The equality is achieved when all peptides of the protein are unique peptides. 
The empirical protein probability Pr E (y = 1) is used as a major factor for measuring the protein 
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(16) 



(17) 



identification confidence. The difference between the upper bound and the lower bound is: 



Pr D (y = 1) = Prt/Cy = 1) - Pr L (y =1). (19) 

The difference Pr D (y = 1) can be used to measure the confidence of the estimation. The smaller 
the value of Pr D (y = 1), the higher the confidence. When all peptides are unique, the probability 
estimation has no ambiguity and Pro(y = 1) = 0. In this case, the confidence is the highest. 

Different from exiting protein probability estimation methods, we have quantitative mea- 
surements of protein identification confidence from different aspects. This makes it possible 
to achieve superior distinction in the protein identification result. 

Unique Peptides and Degenerate Peptides 

Unique peptides play central roles in protein identification. We are more confident at the identi- 
fication when more peptides from this protein are unique. According to our calculation (18), the 
highest confidence of a protein is achieved when its peptides are all unique. 

Degenerate peptides can increase the empirical protein probability Pr £ . However, the degree 
of the increase depends on how many times the degenerate peptides are shared by other proteins. 
To see this, let us consider a protein with unique peptide indices being {1,2, ...,M— 1} and a 
degenerate peptide M. According to equation (16), we have: 

P^ = 0) = n(l-^- T yPr(x ( -=l)) 

M-l 2"M 

= n (1 -Pr(* = 1))(1 - 2(2 „ M _ 1} Pr(*M = 1)) (20) 

M-l 

< n p r(^-=o)- 

(=l 

Thus, degenerate peptide M increases Pr £ : 

M-l 

Pr E (y = 1) = 1 -Pr(y = 0) > 1 - [J Pr(* ; = 0). (21) 

(=l 

The absent probability in equation (20) is monotonically increasing with respect to hm, which 
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results in the monotonically decreasing in Pr £ . 

Degenerate peptide M introduces ambiguity in protein probability estimation. The confidence 
interval is given by: 



Pr D (y = 1) = Pvu(y = 1) -Pr L (y = 1) 
n M - V 



= nPr(*; = 0) 



n M 



-Pr(x M =l) 



(22) 



We can see that the ambiguity Pro increases when n M increases. 

In conclusion, a degenerate peptide improves the empirical protein probability Pr^ and intro- 
duces ambiguity in protein probability calculation (i.e. Pro ^ 0). As % increases, the increase 
in Pr £ becomes smaller and the confidence interval Pr D becomes larger. 



The Relationship with ProteinProphet 

In our model, the loose assumption and the strict assumption relate to the one-hit rule and greedy 
methods, respectively. In this section, we discuss the relationship of our method with Protein- 
Prophet. 

ProteinProphet calculates the protein probability as the probability that at least one of its 
peptides is present. When processing a degenerate peptide, ProteinProphet apportions this peptide 
among all proteins which share it. The protein with higher probability is assigned with more 
weight. The probability of the absence of the protein is then formulated as: 

M 

My = o) = n ( 1 - w ^ = 1 ) ) • ( 23 ) 

(=1 

Here, wt is the weight assigned to peptide i of the protein. The assumption that the protein 
is present when at least one of its peptide is present is questionable when degenerate peptides 
are detected. Thus, ProteinProphet intuitively partitions the probability of degenerate peptide i 
according to weight w;. Then, a dummy peptide with probability w;Pr(xj) is assumed to be unique. 
Dummy peptides together with original unique peptides are all unique. Under this situation, the 
assumption that the protein is present if any peptide is detected is correct. This is because the 
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protein must be present to explain its unique dummy or original peptides. 

By comparing the ProteinProphet protein scoring function with our model (18), we can see 
that they are very related. In the initial stage, ProteinProphet evenly apportions degenerate pep- 
tides among all corresponding proteins (i.e. w, = — ). That is exactly the lower bound Pr^ in 
our model. At last, low confident proteins tend to occupy less weight whereas high confident 
proteins tend to occupy more weight. Generally for high confident proteins, we have — < w ; - < 1. 
Thus, probabilities of these proteins estimated by ProteinProphet will be within the bound of our 
model. Although ProteinProphet is not strictly originated from the basic probability theorem, its 
formulation coincides with our model (18). This explains the popularity and good performance 
of ProteinProphet in real applications. 

Results 

In this section, we first describe our experimental settings such as datasets and tools used in the 
experiments. Then, we illustrate the unique peptide probability adjustment, which is a preprocess- 
ing step in protein inference. Next, we explain the reporting format of our protein identification 
result. Finally, we present and discuss the experimental results. 

Experimental Settings 

The evaluation of our method is conducted on four public available datasets: ISB, Sigma49, 
Human and Yeast. The ISB dataset was generated from a standard protein mixture which contains 
18 proteins.^ The sample was analyzed on a Water s/Micromass Q-TOF using an electrospray 
source. The Sigma49 dataset was acquired by analyzing 49 standard proteins on a Thermo LTQ 
instrument. The Human dataset was obtained by analyzing human blood serum samples with 
Thermo LTQ. The Yeast dataset was obtained by analyzing cell lysate on both LCQ and ORBI 
mass spectrometers from wild-type yeast grown in rich medium. The information of each dataset 
is shown in Table [Q 
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Table 1: Names and URLs of Data Files. 



Dataset File Name 



URL 



ISB 



QT2005 1230_S_1 8mix_04.mzXML 



http://regis-web.systemsbiology.net 



/PublicDatasets/ 



Sigma49 



Lane/060121 Yrasprg051025ct5.RAW 



http s : //proteomecommons . org/ 



Human 



PAe000330/021505_LTQ10401_l_2.mzXML 



dataset.jsp?i=71610 
http://www.peptideatlas.org/ 



repository/ 



Yeast 



YPD_ORBI/06 1220.zl.mudpit0. 1.1/ 
raw/OOO.RAW 



http ://aug .c sres .utexas .edu/msnet/ 



When analyzing the ISB dataset and the Sigma49 dataset, we use the curve of false positives 
versus true positives to evaluate the performance. The ground truth of the ISB dataset and the 
Sigma49 dataset contains 18 and 49 proteins, respectively. A protein identification is a true 
positive if it is from ground truth. Otherwise, the protein is a false positive. Given the same 
number of false positives, more true positives mean a better performance. When analyzing the 
Human dataset and the Yeast dataset, we prefer the curve of decoys versus targets because the 
ground truth is not known in advance. Given the same number of decoys, the more the targets, 
the better the performance. We also plot the curves of decoys versus targets for the ISB dataset 
and the Sigma49 dataset. 

The database we use is a target-decoy concatenated protein database, which contains 1048840 
proteins. The decoys are obtained by reversing protein sequences of UniProtKB/Swiss-Prot (Re- 
lease 201 1_01). In our experiments, XITandem (Version 2010.10.01.1) is employed to identify 
peptides from each dataset. In database search, the parameters "fragment monoisotopic mass 
error", "parent monoisotopic mass error plus" and "parent monoisotopic mass error minus" are 
set to be 0.4Da, 2Da and 4Da, respectively. The number of missed cleavages permitted is 1. 
Then, PeptideProphet and iProphet embedded in TPP (Version v4.5 RAPTURE rev 2, Build 
201202031 108 (MinGW)) are used to estimate peptide probabilities.!^!!] At last, ProteinProphet 
and our method are applied to estimate protein probabilities. The performance of ProteinProphet 
is compared to that of our method. 
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Adjust Unique Peptide Probabilities 

Protein inference models take the peptide identification results as input. If the peptide probability 
estimation is perfect, peptide probability adjustment is not essential. However, inferior peptide 
probabilities always exist. 

Unique peptides are important for protein identification. A confident misidentified unique 
peptide (i.e. Pr(;c = 1) = 0.99) will result in a high confident protein identification with a high 
Pre and a low Pro. For example, if a tandem mass spectrum is matched to a decoy peptide, the 
peptide is very likely to be unique. The unique high confident decoy peptide will lead the decoy 
protein to be identified with a high confidence. This motivates the procedure of unique peptide 
adjustment as a preprocessing step of our method. 

Suppose a protein has m unique peptides. The adjusted unique peptide probability can be 
calculated as: 

Fr(x . = Um) = Pr(m|^=l)Pr(^=l) 

{l 1 ' Pr(m|x i - = l)Pr(x;=l)+Pr(m|x ( - = 0)Pr(x i - = 0)" V ; 

Here, peptide i is a unique peptide; Pr(jC; = 1 ) is the probability that the unique peptide is true. The 
terms Pr(ra|x,- = 1) and Pr(m|x; = 0) describe the probabilities of observing m unique peptides of 
the protein when the unique peptide i is a true and a false identification, respectively. We model 
Pr(m|X( = 1) and Pr(m|x, = 0) as Poisson distributions with different expected number of unique 



peptides (i.e. X\ and A2): 



Pr(m\xi 

ml . . (25) 
X m e~ Xl 
Pr(m\xi = 0) = 2 



m 



Generally, a true unique peptide tends to have more sibling unique peptides than a false unique 
peptide on average. Thus, we have X\ > X%. In our program, these two parameters can be manu- 
ally specified. Alternatively, these two parameters can be obtained empirically. 

Suppose there are Af candidate proteins and the number of unique peptides of protein j(j e 
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{1,2, ...,N}) is nij. The empirical value of the expected value X\ is estimated as: 

X\ = !=± / ( 26 ) 

E7 =1 /(m ; ->2) 

Here, /(•) is an indicator function with value being either or 1. 

Empirically, hi can be 1 . It is common to observe that false proteins such as decoy proteins 
to have a single unique peptide. 

The adjusted unique peptide probability Pr(x, = 1 |m) is used as the probability of peptide i in 
our model (18). 

The Protein Identification Result 

There are three different kinds of relationships between two proteins: 

• Indistinguishable: If two proteins contain exactly the same set of identified peptides, they 
are indistinguishable. Indistinguishable proteins can be treated as a group. 

• Subset: Identified peptides of a protein form a peptide set. If the peptide set of a protein is 
the subset of the other, the former protein is a subset protein of the latter. 

• Differentiable: Two proteins are differentiable if they both contain peptides those are not 
from the other. 

In the literature, subset proteins are generally discarded. However, it is not reasonable to 
regard all subset proteins as being absent from a statistical point of view. Thus, we also calculate 
the protein probability of subset proteins and organize our result in two separate files as shown 
in Table |2j In the table, proteins 1 and 3 are indistinguishable proteins and protein 2 is a subset 
protein of protein 1 . 
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Table 2: Subset and Non-subset Proteins. 



Non-subset Proteins 


Index 
1 
2 


Protein 
1 
4 


Pr E Pr L Pru Pro 
Prjjtyi) Prdyi) Ptyfri) Prz>fri) 
Pr £ (y 4 ) Pr L (j4) Prc/(j 4 ) ProM 


Other Proteins 
3 


Subset Proteins 


Index 
1 


Protein 
2 


Pr £ Pr L Pry Pr D 

Pr £ (j 2 ) Pr L (y2) Prt/(y 2 ) Pr D (j 2 ) 


Subset of Protein 
1 



The empirical probability Pr^ and the bound Pro quantitatively describe the confidence of a 
protein. Generally, a high Pr E and a low Pr D mean a confident protein. The identification result 
is mainly sorted by the empirical protein probability Pr^. In case two proteins have the same Pr^, 
the order is then determined by Pr D . 

One purpose of protein identification is to select proteins to explain observed peptides. Subset 
proteins do not increase the peptide explanation power. In general, we can perform downstream 
analysis by only using non-subset proteins. However, according to the probability theorem, the 
probability of a subset protein is not necessarily smaller than that of a non-subset protein. The 
data explanation power and protein probability are totally two different kinds of things. Without 
any prior knowledge, choosing proteins from data explanation's viewpoint is safe. However, in 
some cases, we may consider high confident subset proteins according to the prior knowledge 
we have. For instance, the sample contains homogeneous proteins and it is possible that subset 
proteins are present. 
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Figure 2: An example of the protein identification result. In the figure, there are two proteins 
and three peptides with corresponding probabilities all being 0.9. Protein 2 is a subset protein 
of protein 1 . The probability Pr^ and the bound Pro are two quantitative measurements of the 
confidence of a protein. The higher the value of Pr^ and the smaller the value of Pro, the more 
confident the protein. Protein 1 is a confident protein with a high empirical probability Pr^ = 
0.984 and a tight bound Pro = 0.029. For protein 2, there are two peptides present. Without 
any prior knowledge, we cannot determine the presence of protein 2 mathematically. From the 
data explanation's aspect, we can report protein 1 only. Protein 2 can be considered if the protein 
coverage is a concern and homogeneous proteins are known to be present. 

Proteins 



1 2 




i,Pr = 0.9 2,Pr = 0.9 3,Pr = 0.9 
Peptides 



An example is shown in Figure|2]for the illustration purpose. Protein 1 is more confident than 
protein 2. From data explanation's viewpoint, protein 1 is present whereas protein 2 is absent. 
This is because including protein 2 in the final protein list will not improve the data explanation 
efficiency. When we know that homogeneous proteins are present and desire more proteins, 
we can merge subset and non-subset proteins to obtain the final result by filtering proteins with 
thresholds on Pr^ and Pro- 

In our experiment, only non-subset proteins are considered in the comparison study. 
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Protein Identification Results on Four Datasets 



Figure 3: Protein identification results on four datasets. In (a) and (b), the curves of false versus 
true for the ISB dataset and the Sigma49 dataset are plotted; in (c) and (d), we show the curves 
of decoy versus target for the ISB dataset and the Sigma49 dataset; in (e) and (f), the curves of 
decoy versus target for the Human dataset and the Yeast dataset are shown. Considering that the 
performances measured by different validation methods may differ from each other, we also draw 
the curves of decoy versus target for the ISB dataset and the Sigma49 dataset. 
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The protein identification results on four datasets are shown in Figure|3j Our method outperforms 
ProteinProphet in Figure [3]^a) and achieves a comparable result with ProteinProphet in Figure 
[3jb). When using the curve of the decoy number versus the target number, our method dominantly 
outperforms ProteinProphet in Figure [3jc)-(f). 

Table 3: Running time of our program compared with ProteinProphet. The total running time 
includes loading the peptide identification result, estimating protein probabilities and reporting 
the final result. 



Program 


ISB 


Sigma49 


Human 


Yeast 


Proteinlnfer 


0.340s 


0.478s 


1.057s 


0.920s 


ProteinProphet 


14.273s 


15.103s 


16.473s 


14.710s 



All formulations of our method are closed-form. Thus, our method can calculate protein 
probabilities efficiently. In Table |3} we show the running time of our method compared with 
ProteinProphet. The comparison is conducted on a computer with 4GB memory and Intel(R) 
Core(TM) i5-2500 CPU running the 32bit Windows 7 operating system. The final result shown 
in the table is the average running time of ten runs. The total running time includes loading the 
peptide identification result, estimating protein probabilities and reporting the result. The com- 
parison of running time indicates the efficiency of our method in calculating protein probabilities. 

According to the experimental results on four public available datasets, our method achieves 
competitive performance with ProteinProphet in a more efficient manner. 

The Parameter Issue 

In the preprocessing step, there are two parameters X\ and A2 corresponding to the expected 
number of unique peptides of true proteins and false proteins, respectively. These two parameters 
are estimated empirically from data. Alternatively, these two parameters can be set manually. 
Here, the performances of our method with different parameter settings are compared to show 
whether our method is sensitive to the parameter setting. 

In this section, we conduct our experiment on the ISB dataset and the Sigma49 dataset. These 
two datasets have groundtruth, which reflects the impacts of parameter settings accurately. 
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Table 4: Parameter settings. The default empirical parameter setting is marked with "*". In the 
experiment on each dataset, the performance based on the default parameter setting is shown for 
reference. 



Index 


ISB 


Sigma49 


i 
i 


Ai =5 


Ai =2 


A 2 = 1 


A 2 = 1 




Ai - 15 


Ai =9 


Z 


A 2 = 1 


A 2 = 1 




Ai = 10 


Ai =5 


3 


A 2 = 1 


A 2 = 1 




Ai = 12 


Ai =7 


4 


A 2 = 5 


A 2 = 3 




Ai = 12 


Ai =7 


5 


A 2 = 10 


A 2 = 5 



The empirical estimations of Ai for the ISB dataset and the Sigma49 dataset are 12 and 7, 
respectively. The parameter settings we consider are shown in Table |4j Under each parameter 
setting, we use the curve of false positives versus true positives to measure the corresponding 
performance of our method. The curve obtained by using the empirical parameters is taken as a 
reference. We calculate the correlation of other curves with the reference to illustrate the perfor- 
mance variation in different conditions. Figure [4] shows the results. 
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Figure 4: The performances of our method on the ISB dataset and the Sigma49 dataset under 
different parameter settings shown in Table |4| 
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In our model, we require that X\ > X%. We also conduct experiments to show what the result 
is when parameters are misspecified (i.e. X\ < A2). The result is shown in Figure|5j 
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Figure 5: The performances of our method when parameters are set as X\ < %i. 
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According to the results in two figures, we can see that our method is not sensitive to the 
parameter setting. The wrong parameter setting has a great impact on the protein identification 
result. Thus, we do not allow X\ < X2 in our program. 

More on Unique Peptide Probability Adjustment 

From the previous experiment, we can see that our method is not sensitive to the parameter setting. 
The only requirement is that parameters Ai > A2. According to Figure |5J the performances under 
different parameters are close. Readers may be interested to know why we should adjust peptide 
probabilities. 
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Table 5: The protein probabilities of decoy proteins before and after peptide probability adjust- 
ment. The last column is the rank position of the protein in the corresponding result. In the table, 
Pro are 0.0000 because peptides of decoy proteins tend to be unique. According to our inequality 
(18), Pr^ = Pr/j when all peptides are unique. 



Probabilities Without Adjustment 



Index 


Dataset 


Protein 


Pr £ 


Pro 


Rank 


1 


ISB 


decoy_499748 


0.9971 


0.0000 


46 


2 


ISB 


decoy_237394 


0.9243 


0.0000 


54 


3 


ISB 


decoy_224201 


0.8864 


0.0000 


55 


4 


Sigma49 


decoy_5 19997 


0.9517 


0.0000 


62 


5 


Sigma49 


decoy_170817 


0.9417 


0.0000 


64 


6 


Sigma49 


decoy_271930 


0.8248 


0.0000 


74 


Probabilities With Adjustment 


Index 


Dataset 


Protein 




Pro 


Rank 


1 


ISB 


decoy_499748 


0.0650 


0.0000 


50 


2 


ISB 


decoy_237394 


0.0024 


0.0000 


54 


3 


ISB 


decoy_224201 


0.0016 


0.0000 


55 


4 


Sigma49 


decoy_5 19997 


0.2548 


0.0000 


72 


5 


Sigma49 


decoy_170817 


0.2190 


0.0000 


73 


6 


Sigma49 


decoy_271930 


0.0755 


0.0000 


77 



Let us consider the top three decoys proteins in the experiments on the ISB dataset and the 
Sigma49 dataset for the illustration purpose. Table [5] shows the protein probabilities of decoy 
proteins before and after peptide probability adjustment. According to the result, the protein 
probabilities of decoy proteins are decreased and they are ordered behind more target proteins. 
Decoy proteins are representative of a kind of error in protein identification. It is not desired to 
detect a decoy protein with a high confidence (e.g. decoy _499748 is detected with Pr# = 0.9971 
and Pro = 0.0000). In this sense, the protein identification result becomes more meaningful after 
unique peptide adjustment. High confident decoy proteins are detected because its corresponding 
decoy peptides are detected with high confidence. Since peptide probability calculation is not 
perfect, we need to adjust it to achieve a more meaningful protein identification result. 

In conclusion, unique peptide probability adjustment can improve the protein identification 
result (i.e. decoys proteins are ranked behind more target proteins) and make the result more 
meaningful than the result before adjustment. Thus, keeping the adjustment procedure in our 
program is essential. 
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Discussions and Conclusions 



Protein Probability Interval 

Protein probability interval Pro can be used to improve the distinction of protein identification 
results as well as to filter protein identification results. 

Table 6: The number of indistinguishable proteins based on probabilities without and with Pr£>. 



Dataset 


Without Pr D 


With Pr D 


ISB 


27 


21 


Sigma49 


38 


36 


Human 


60 


58 


Yeast 


181 


178 



Table [6] shows the numbers of indistinguishable proteins (based on the protein probability) 
without and with Pro on four datasets. From the result, we can see that Pro decreases the number 
of indistinguishable proteins. 

Table 7: The number of subset proteins without and with Pro. In the table, "&&" means logical 
"AND". 



Dataset 


Pr £ > 0.9 


Pr E > 0.9&&Pr/) < 0.02 


ISB 


325 


15 


Sigma49 


110 


4 


Human 


12 


4 


Yeast 


126 






More importantly, Pr^ can be used as an extra filtering standard when Pr^ alone does not 
work effectively. This is very useful in the case when subset proteins are considered (e.g. protein 
identification rate is not satisfactory and homogeneous proteins are known to be present). Table 
[7] shows the number of subset proteins when using Pr^ > 0.9 and Pr^ > 0.9&&Pro < 0.02 as 
filters, respectively. Here, "&&" means logical "AND". A great number of subset proteins can be 
filtered out by using Pro- Thus, Pr^ and Pr^ form an effective filter to pick confident proteins. 

More on the Distinction of Protein Identification Results 

When calculating protein probabilities, we often find that many proteins are assigned with the 
maximal score of one. The phenomenon can be explained with our model by considering the 
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following example: 

• Suppose a protein has three unique peptides with probabilities 0.97. 

• The protein probability based on the three unique peptides is 1 — (1 — 0.97) 3 = 0.999973. 

• Unique peptides are important to protein inference. According to equation (21), any ex- 
tra identified peptides will further increase the confidence of the protein. Thus, we have 
0.999973 < Pr £ < 1 .0 and < Pr D < 0.000027. When only four decimal places are shown, 
we will have Pr^ = 1 .0000 and Pro = 0.0000. Actually, many proteins are assigned to prob- 
ability 1.0 because small numeric errors are ignored. 

The poor distinction is mainly caused by the ignorable numeric errors. Considering the impor- 
tance of unique peptides in protein inference, the distinction can be improved by sorting pro- 
teins with score one according to the number of unique peptides in descending order. The result 
is shown in Figure [6j In the figure, the performance of "Proteinlnfer+Unique" is obtained by 
considering the number of unique peptides as an extra information in determining the order of 
proteins. This trick can be used when reporting the protein identification result. 

Figure 6: The performance of "Proteinlnfer+Unique" is obtained by considering the number of 
unique peptides as extra information in determining the order of proteins. 
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However, this strategy does not work when two confident proteins have the same number of 
unique peptides. The key factor in estimating distinct protein probabilities is the peptide prob- 
ability calculation. Unique peptides are important to protein inference and degenerate peptides 
will further increase the confidence of a protein. If the probability of a protein computed from 
its unique peptide is high, the protein must be confident. Thus, we need to estimate peptide 
probabilities conservatively especially for unique peptides. 

Conclusion 

In this paper, we propose a combinatorial perspective of the protein inference problem. From 
this perspective, we obtain the closed-form formulations of the lower bound, the upper bound 
and the empirical estimations of protein probability. Based on our model, we study an intrinsic 
property of protein inference: unique peptides are important to the protein inference problem 
and the impact of a degenerate peptide is determined by the number of times that the peptide is 
shared. In our experiments, we show that our concise model achieves competitive results with 
ProteinProphet. 
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